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Abstract 



We study distorted metrics on binary trees in the context of phylogenetic reconstruction. 
Given a binary tree T on n leaves with a path metric d, consider the pairwise distances {d(u, v)} 
between leaves. It is well known that these determine the tree and the d length of all edges. Here 
we consider distortions d of d such that for all leaves u and v it holds that \d(u, v) — d(u, v)\ < f/2 
if either d(u, v) < M or d(u, v) < M, where d satisfies / < d(e) < g for all edges e. Given such 
distortions we show how to reconstruct in polynomial time a forest T\ , . . . , T a such that the true 
tree T may be obtained from that forest by adding a — 1 edges and a — 1 < 2~ Q ( M ' 3 ^n. 

Metric distortions arise naturally in phylogeny, where d(u, v) is defined by the log-det of a 
covariance matrix associated with u and v. When u and v are "far", the entries of the covariance 
matrix are small and therefore d(u,v), which is defined by log-det of an associated empirical- 
correlation matrix may be a bad estimate of d(u, v) even if the correlation matrix is "close" to 
the covariance matrix. 

Our metric results are used in order to show how to reconstruct phylogenetic forests with small 
number of trees from sequences of length logarithmic in the size of the tree. Our method also 
yields an independent proof that phylogenetic trees can be reconstructed in polynomial time from 
sequences of polynomial length under the standard assumptions in phylogeny. Both the metric 
result and its applications to phylogeny are almost tight. 

1 Introduction 

Reconstructing phylogenies have been a scientific challenge for the last 50 years. We refer the reader 
to |Hj or for general and mathematical background. The standard setting in phylogeny is of trees 
where the leaves are labeled by taxa or species. Given aligned sequences at the leaves, we define 
character i to be the collection of letters at position i for all the species. Under the i.i.d. assumption, 
the characters are independent samples from the evolutionary process on the tree. 

The theoretical foundations of most methods used in phylogeny are unsatisfactory. Under the 
standard i.i.d. model, Parsimony is not consistent |2J|S] and is NP hard to compute [U llUlITT] . The 
computational complexity of finding the Maximum likelihood tree is not known and the best bounds 
on the the amount of data needed are exponential in the number of taxa |19j . 

Computational complexity and information theory considerations have not played an important 
role in phylogeny in the past as biologists were mostly interested in reconstructing trees on a small 
number (typically at most a few hundred) species. However, one of the major goals of systematic 
biology in the coming decade is to reconstruct phylogenies on millions of species. It is clear that 
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for such numbers, it is crucial to apply algorithms with low computational complexity. Similarly, 
algorithms should use information efficiently. 

In jS] the authors developed the first reconstruction algorithm satisfying two important properties: 

• Given number of characters that is polynomial in the number of taxa, the algorithm finds the 
true tree with high probability. 

• The running time of the algorithm is polynomial in the number of taxa. 

Variants of these method and generalizations from two states models to general models appeared 
in 0. In |S] the authors discuss a closely related problem of learning a phylogenetic tree (in the 
PAC setting). They developed a PAC learning algorithm for the two state model. The problem of 
PAC-learning general-state model in polynomial time is still open. See also [Jj for an earlier result 
on learning phylogenies. 

The method developed in is a distance method. Such methods were commonly used in phy- 
logeny before, but is the result where a distance method yields a provably good performance. 
Distance methods are based on defining a path metric on the tree based on the evolution model. 
Then the distance between leaves of the tree is approximated by some distance between the corre- 
sponding sequences at the leaves. In this sense all distance methods in phylogeny may be view as 
reconstruction methods from distorted metrics. 

Given the existence of a polynomial time reconstruction algorithms, the next problem is opti- 
mizing the sampling complexity. The number of characters needed (= the length of sequences) is 
of great practical importance, as this number is bounded by the underlying biology. It is therefore 
desirable to minimize this number. 

Since there are 2®( nlogn ) binary trees on n leaves, an easy counting argument yields that the 
number of characters needed is at least logarithmic in the number of taxa. Thus we are led to 
the following natural problem: is the length of the sequences needed logarithmic in n as or is it 
polynomial? 

In ^3] we showed that for a restricted family of models, it is possible to reconstruct phylogenies 
from a logarithmic number of characters, if the mutation rates are low (bounded above by some 
constant). We also showed that a polynomial number of characters is needed if the mutation rates 
are high (bounded below by some constant). 

We later (see also |17j ) generalized the polynomial lower bound for high mutation rates to 
a large family of models. In ^S] we analyze another model where logarithmic reconstruction is 
achievable for low mutation rates. 

The phase transition discussed above is of crucial interest if we wish to reconstruct all the tree. 
However, in some cases, a more modest objective is posed: reconstruct a "large portion" of the tree. 
Practitioners (this was kindly noted to me by J. Felsenstein (2001, private communication) and J. 
Kim (2003, private communication)) have noticed that this problem seems to be much easier than 
the problem of reconstructing the complete tree. 

In this paper we prove that this is indeed the case. 
Definition 1.1. We define the operation of edge adding to a forest as one of the following 

• Add an edge (u,v) connecting two isolated leaves u and v. 
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Given an edge (u, v) of the forest and an isolated leaf w, replace the edge (u, v) by the edges 
(u,w'), (w',v) and (w',w) where w' is a new vertex. 

Replace the two edges (u±,vi) and (u2,V2) of the forest by (ui,wi),(wi,vi),(u2,W2),(w2,V2) 
and {w\,W2) where wi,W2 are new vertices. 



See figureU\ 
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Figure 1: The "edge-adding" operation 



We show that under the standard assumptions in phylogeny, for a tree T on n leaves and for 
all 5 > and 7 > we can reconstruct from 7 _0,5 ( 1 ) logn characters a forest T\,... ,T a such that 
a < 1 + 771 and that T may be obtained from the forest T±, . . . ,T a by adding at most a — 1 edges. 
The reconstruction is performed in a polynomial time and with error bounded by 6. 

Note that taking 7 to be a small constant we obtain that "most" edges of the tree can be 
reconstructed from O(logn) characters. Taking 7 = 1/n, we obtain that the full tree may be 
reconstructed from a polynomial number of characters. Thus obtaining an independent proof of the 
results of 00 El- 

Our results indicate what level of refinement is achievable in reconstructing a phylogenetic tree 
given a certain amount of data. We believe that our techniques may also play an important role 
in reconstructing the complete phylogenetic tree in low mutation rates from logarithmic number of 
characters as it allows a very clean divide and conquer approach (see |14j). 

The main ingredient of the proof is a metric theorem that can be stated roughly as follows. Let 
T be a binary tree and d a path metric on T such that / < d(e) < g for all edges e. Let C(T) be the 
set of leaves of T and d : C{T) x C(T) — > H a distortion of d that satisfies \d(u, v) — d(u, v )| < f/2 if 
d(u,v) < M or d(u,v) < M (d typically does not correspond to a path metric on T). We show that 
we can partition C(T) into a sets L%, . . . , L a , where a = n2~™ M ' 9 > . For 1 < f3 < a we reconstruct 
a tree such that the leaves of Tp are Lp and the tree T may be obtained from the forest (Tp)p< a by 
adding at most a — 1 edges. 

It is easy to see that the metric theorem is tight up to the constant in the Vt by considering an r- 
level 3-regular tree, where the length of all edges is exactly g and d(u, v) = 00 if d(u, v) > M. Similar 
tightness results hold for the number of trees in the forest in the phylogenetic reconstruction. The 
proof is more complicated and follows ideas from JH] on lower bounds on the sampling complexity of 
phylogenies. The proof is omitted in this extended abstract. Our metric result may be of independent 
interest to other problems where path metrics on trees are considered. 

We now give a high level sketch of the different sections of the paper. 

• The formal definition of the model and the statement of the main results are given in Section 
[21 where we also discuss how the metric result implies the result in phylogeny. 
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The reduction from the phylogenetic model to distorted metrics is given in Proposition 12.21 
which is an easy reformulation of a large deviation result from jH]. This proposition gives M 
as a function of the number of samples. The distortion error is assumed to be at most e = f /2. 
It is known that Proposition 12.21 is essentially tight. In other words, distances larger than M 
are likely to be computed with error larger than //2. 

• Given the distortion, it is easy to construct for each leaf v , a tree T v on the set of leaves in the 
(M — 7/)/6 (metric)-neighborhood of v. This is done using standard techniques in Phylogeny. 

• The collection of trees {T v } is not the forest we are looking for. First, the trees in this forest are 
not disjoint. Second, there may be many trees in this collection (in fact as many as n different 
trees). The main task of the paper is to "glue" these trees to form edge-disjoint forest. Then 
we can bound the number of trees in the forest. 

• The notion of edge-disjoint trees is studied in Section |3J The results of this section imply that 
if a forest of edge-disjoint trees is a refinement of the collection {T v } above then the size of the 
forest is 1 + n2" n ( M /f). 

• The "glueing" algorithm is given in Section 

• The final metric result is stated in Section [21 

Acknowledgments: The idea that reconstructing a forest should be "easy" was conceived 
during a talk by Junhyong Kim at the kickoff meeting of the Cipres project. J. Kim said that it 
seems like most edges of phylogenies are easy to reconstruct. I thank him and Tandy Warnow for 
encouragement to work on this problem. 

2 Definitions and main results 

Let T be a tree. Write V(T) for the nodes of T, £ (T) for the edges of T and C(T) for the leaves of 
T. If the tree is rooted, then we denote by p(T) the root of T. Unless stated otherwise, all trees are 
assumed to be binary (all internal degrees are 3) and it is further assumed that C(T) is labeled. 

Let T be a tree equipped with a path metric d : £{T) — > H + . d will also denote the induced 
metric on V(T): 

d(v,w) = ~^2{d(e) : e G path(u, u>)}, (1) 

for all v , w £ V(T). 

We will further assume below that the length of all edges is bounded between / and g for all 
e G E. In other words, for all e G £(T), 

f < d(e) < g. (2) 

In applications to phylogeny we are typically given a distortion d : C(T) x C(T) — > K + of d. We 
define an (e, M) distortion as follows. 

Definition 2.1. Given a tree T equipped with a metric d, and two positive numbers < e < M , we 
say that d : C(T) x C(T) — > 1R + U {oo} is an (e,M) distortion of d if 



4 



• d(u,v) = d(v,u) for all u and v in C{T); i.e. 



d is symmetric. 



• Ifd(u,v) 



oo, then d(u,v) > M. 



• // d(u, v) < oo, then \d(u, v) — d(u, v)\ < e. 



It is well known that d : C(T) x C(T) — > H+ determines the underlying tree T and the metric 
on the edges d : £{T) — ► H+. Moreover, there exists a polynomial time algorithm to reconstruct T. 
Similarly, we may recover T from any (e, oo) distortion of d if e < //2. Moreover, in this case, we 



In our main result we show that given an (e, M) distortion of d, we may recover many of the 
edges of T and a good approximation of the d length of those edges. 

Theorem 2.1. Let T = (V,E) be a binary tree equipped with a metric d satisfying Let n = 
\C(T)\, so that \V(T)\ = 2n — 2. Let d be an (e, M) distortion of d and suppose that e < f /2 and 
that M > 7e. Then d determines a partition V of C(T) into sets L±, . . . , L a and a forest Ti, . . . , T a 
such that Lp = V(TJg) n C{T) for all (5 and 

• The tree T may be obtained from the forest T±, . . . , T a by adding at most a — 1 edges. 

• The number of trees a in the forest is at most [1 + ^tj2 2 s J . 

Moreover, 

• the partition {Lp)p< a , 

• the trees (Tp)p< a and 

• a function d : Uj3< a £(Tp) :— > M + satisfying \d[e) — d(e)\ < 2e, 

can be all computed from d in time polynomial on n. 

Mutation models and distances. When reconstructing phylogenies, the data is given as 
sequences at the labeled leaves C(T) and the tree T is unknown. Usually the mutation model is 
defined on a rooted tree while the goal is to reconstruct un-rooted trees (in many models there is no 
way to distinguish a root). 

We let A denote the alphabet in which information is encoded. For example, A = {A, C, G, T} for 
DNA sequences, A = {20 amino acides} for proteins and A = {0, 1} for purine-pyrmidine sequences. 
To define the mutation model we assume that all the edges are directed away from the root and 
for each edge e G £(T) let M(e) be the mutation matrix corresponding to the edge e. M(e) is an 
\A\ x \ A\ stochastic matrix. The (i,j)'th. entry of M(e) is the probability that state i will mutate to 
state j along edge e. It is assumed that each character evolves down the tree as a Markov-chain on 
the tree, where M(e) is the transition matrix for edge e. The root letter is chosen from some fixed 
distribution tt. It is assumed that the characters evolve in an i.i.d. manner - they all come from the 
same distribution and each one is independent from all the others. 
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Two popular examples are the CVS model where :U(r) - ( 1 J ) and the Jukes- 



Cantor models where 
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It turns out that under mild assumptions on the matrices M(e) and the evolution model - the 
log-det distance defines a path metric on the tree |18j . We summarize the basic properties of this 
distance if the following proposition. The proof for the CFN and the Jukes-Cantor model have 
appeared independently several times. The general case follows from a large deviation estimate in 
[g] and is proven in the appendix. 

Proposition 2.2. Assume that the matrices M(e) satisfy that e~ 2g < det(M(e)) < e~ 2 f for all 
e £ S(T) and that for all nodes v £ V(T) and all letters a £ A, the probability that the letter at v is 
a is at least 7r m i n > 0. Let e > 0. 

For every two vertices u,v € C(T) and a,b £ A, let F a ^ be the probability that node u has letter 
a and node v has letter b. Let F a ^ be the empirical distribution that node u has letter a and node v 
has letter b. Let d(u,v) = — log det(i ? j J ) and d(u,v) = — log det(i^,) if — log det(i ? j J ) < M + e, and 
d(u, v) = oo if — logdet(-Fjj) > M + e or det(F id ) < 0. Then 

• d(u,v) is a path metric on the tree satisfying g > d(e) > f for all edges e of the tree (where g 
depends on g' and ir m i n ). 

• There exists a constant c such that for all r > 2 if the number of sample satisfies 
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k > {1 _ e -2ey e + l °Sn, (3) 

then with probability at least 1 — n 2 ~' r it holds that d is an (e, M) distortion of d. 

The proposition may be used with different values of the parameters k and M. Fix e < //2. 
Taking M to be the diameter of the tree, it gives that all empirical distances are within e of the 
true distances once k is exponential in M. Since M may be as large as £l(gn), this gives sampling 
complexity k which is exponential in n. Taking M = lOOg, say, would give an (e, lOOg) distortion 
from k = O(e 10 ° 9 log n) = 0(log n) samples. In sequel we will use M ranging between M = 0(g) for 
a constant c to k = O(glogn) (in particular, typically we will only have a fraction of the distances 
within e of their true value). 

Combining Proposition 12.21 and Theorem 12.11 we obtain 
Theorem 2.3. Consider a binary phylogenetic tree T , where the log-det distance associated with the 
mutation matrices M(e) satisfy the conditions of Proposition^^ Then given k satisfying J3J), with 
M > 7e, and e < //2 we can with probability at least 1 — n 2 ~ r recover a partition V of C(T) into 
sets L\, . . . , L a and a forest T\, . . . , T a such that Lp = V(Tp) n C(T) for all (3 and 

• Ti, . . . ,T a is a forest that may be obtained from T by removing a — 1 edges. 

M-e 

• The number of trees a in the forest is at most [1 + ^tj2 2 » J . 
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Moreover, we can recover a function d : [Jp< a S(Tp) :— > 1R + satisfying \d(e) — d(e)\ < 2e for all e. 

Note that taking M = O(g) to be a large constant and e = //2 proves that most edges of the tree 
can be recovered from O(logn) characters. Similarly, taking M = O(glogn) and e = f/2, we see 
that we can recover the underlying tree from k = n ^ characters, thus obtaining an independent 
proof of the results of El E] • 

3 Edge disjoint trees 

Edge disjoint trees and edge sharing trees will play a crucial role below. In this section we define 
these notions and discuss some of their basic properties. 

We let T be a binary tree with vertices V(T) and edges S(T). We let C{T) C V(T) be the set of 
leaves of T and n = \C(T)\ the size of this set. We write path T (x, y) for the path (sequence of edges) 
connecting x to y in T. We will sometime omit the subscript T and write path(x, y). We write 
£(x, y) or £t(x, y) for the number of edges in the path connecting x and y. For two sets A, B C V(T) 
we write £t(A, B) = min{£(x, y) : x G A, y S B}. 

Removing an edge e from a tree T results in obtaining two trees T\ and T 2 . The split defined by 
e is the partition {C{T) n V(7i),£(T) n V(T 2 )} of C(T). We denote by E(T) the collection of £(T) 
splits defined by all edges of T. It is well know that X(T) determines T (see e.g. |161 Chapter 3]). 
We denote the split {A, B} by A\B. 

Definition 3.1. Let T be a binary tree and L C C(T) a set of leaves. The restriction of T to L is 
defined as follows. This is the tree whose leave set is L and whose splits are defined by 

S(T|L) := {A\A' : A = B n L, A' = B' n L and B\B' £ E(T)}. 

The restriction ofT to L is denoted by T\L. 

Given two sets L\,L2 C C(T), we say that the trees T\L\,T\L2 are edge disjoint if 

path T (m, v\) n path T (u 2 , v%) = 0, 

for all U\, v\ £ L% and U2,V2 € L%. We say that T\L\,T\L2 are edge-sharing if they are not edge 
disjoint. 

The following easy lemma is useful as it shows that edge disjointness does not depend on the 
ambient tree. 

Lemma 3.1. Let L% U L 2 C V C C{T). Then T\L\ and T|L 2 are edge disjoint if and only if 
path T i £/(«!) v\) n path T | L /(u 2 , i> 2 ) = 0, for all ui,v\ £ L\ and n 2 , f 2 G i 2 . 

In particular, L\ and L 2 are edge disjoint if and only i/path T | iiUL2 (ui, fi)npath T | iiUi2 (n 2 , t> 2 ) = 
0, /or a// iti, «i G i>i and u 2 , w 2 G L 2 . 

Proof. The second statement follows immediately from the first one by letting L' = L\ U L 2 . 

For the first statement, note that path T (ni,ui) is obtained from path T i i /(ui, ui) by replacing 
each edge (x, y) G path T | L /(ui, by a sequence of edges (x = x\, x 2 ), . . . , (xj, Xj+i = y). Moreover, 
the sequence [x = x±, x 2 ), . . . , (xj, Xj+i = y) depends on the edge (x, y) only and each edge (xj, Xj+i) 
of T appears in the sequence of at most one edge (x, y) of T\L' . 
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It now follows that path T |jy (wi, V\) D path T | L /(ti2, V2) = 0, for all U\,V\ G L\ and U2,i>2 G L2 if 
and only if path T (ui, v±) n path T (ti2, ^2) = for ah £ -^1 and ^2,^2 £ -^2, as needed. ■ 

Next we sate a useful closure property. 

Lemma 3.2. // T\L\ and T\L2 are edge sharing and L\ U L2 C C{T), then every edge of e of 
T\L\ U L2 belongs to a path path T | LiUL2 (u, v) where u,v G L\ or u,v £ L2. 

Proof. Suppose otherwise and let e = (w,w') be an edge of T\L\ U L2 that does not belong to any 
such path. It follows that all the vertices of L\ are on one side of that edge and all the vertices of 
L2 on the other side. 

Thus path T | LiUL2 (ui, i?i) n path T | iiUL2 (u 2 , ^2) = for all Ui,V\ G L\ and u 2 ,v 2 G L 2 . This in 
turn implies by Lemma 1.3 . 1 1 that L\ and L2 are edge disjoint in contradiction to our assumption. ■ 

Lemma 3.3. Suppose that (T\L\,T\L2) are edge sharing while (T\Li,T\L3) and (T\L2,T\L^) are 
edge disjoint, and let L = L\ U L2. Then {T\L,T\L^) are edge disjoint. 

Proof. Suppose otherwise. Let u,u' G L and v,v' G L3 such that path T | i (u, u') PI path T | L (t;, v') 7^ 0. 
Let e be an edge that belongs to their intersection. By the previous lemma, it follows that there 
exists w,w' G L\ or w, w' G L2 such that e G path T | L (u;, w'). Now path T | L (tt), w') Pi path T | L (v, v') is 
not empty - in contradiction to the fact that [T\L\,T\L^) and (T\L2,T\L^) are edge disjoint. ■ 

We note that for binary trees, the notions of edge disjointness and vertex disjointness coincide. 
Let T be a tree and L\,L2 C C{T). We say that T\L\ and T\L2 are vertex-disjoint if path T (ui, v 1) 
and path T (ti2, V2) have no vertices in common for all u\,v\ G L\ and U2,V2 G L2. 

Proposition 3.4. Let T be a tree and let £1,^2 C C(T). Then ifT\L\ and T\L2 are vertex- disjoint, 
they are also edge-disjoint 

Let T be a tree where all the internal nodes are of degree 2 or 3 and let L\,L2 C C(T). Suppose 
furthermore that T\L\ and T\L2 do not consist of a single vertex. Then if T\L\ and T\L2 are 
edge- disjoint, they are also vertex- disjoint 

Proof. If two paths share an edge they also share the two end points of that edge, so the first claim 
follows. 

For the second claim, suppose that T\L\ and T\L2 are edge disjoint but have the vertex v in 
common. If v is a leaf, then both T\L\ and T\L2 share the edge adjacent to that leaf - a contradiction. 

If v is not a leaf, then there are u\,v\ G L\ and U2 5 ^2 £ L2 such that v G path T (ui, v\) n 
path T (u2) ^2). But the degree of v is at most 3, therefore the two paths path T (ui, v 1) and 
path T (n2, V2) have non-empty edge intersection - a contradiction. The proof follows. ■ 

Edge disjoint trees naturally define a forest. 

Lemma 3.5. Let T be a binary tree. Let L\, . . . ,L a be a partition of C(T) and let (T 7 = T|L 7 )° =1 
be a collection of (pairwise) edge disjoint trees. Then the tree T may be obtained from T\, . . . , T a by 
adding a — 1 edges. 

Proof. Note first that if L = L\ UL2 and T\L\,T\L2 are edge disjoint then L\\L2 G S(T). Therefore, 
in this case, T may be obtained from T\L\ and T\L2 by adding a single edge. 
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In the general case, define £t(T\L/3, T|L 7 ) by 

mm{£x(u,v) : u G path T (-u', u"), v G path T (V, f"), G -Lg, i/,i/' G L 7 }. 

Take /? 7^ 7 that minimize the distance £(T|L^,T|L 7 ) among all pairs (/?, 7). 

It is easy to see that TjL^ U L 7 is edge disjoint from Tp for /?' ^ 7}. The general case follows 
by induction. 

■ 

We say that an edge e G £(T) belongs to T\L, if there exist u,v G L such that e G path T (u, v). 

We say that the directed edge e belongs to T|L if the edge e belongs to T|L. The distance of 
> 

directed edge e = (1*1,1(2) to a set of vertices V C V(T) is the minimal length m — 1 of a simple 
path (ui, U2), • • • , (« m _i, u m ) such that u m G V. We denote this distance by £t( e , V). Finally, let 
^(e, V) = max e , V'),?r( e , V)) (the max is over the two orientations of e). 

Lemma 3.6. Let T be a binary tree and L%, . . . , L a be a partition of C(T). Suppose that (T\Lp)p =1 
is a collection of edge disjoint trees and that for all edges e G £ (T) with £^(e, C(T)) < r the edge e 
belongs to one of the trees T\Lg. Then a < 1 + 30 x 2~ r n. 

Note that £* T (e = (u,v),V) > min{£(u,V'),£(v,V')} and strict inequality may hold (see Figure 
|3J). Thus the lemma does not follow from the fact that fractions of vertices at distance r from the 
set of leaves is at most 2~ r . 



e 




Figure 2: Note that t T (e = (u,v),£(T)) = 3 while min{£(u,C(T)),t(v,C(T))} = 



Proof. Following the argument of the previous lemma, it is easy to see that each edge we add must 
satisfy £j,(e,C(T)) > r. Let A m be the set of all edges whose £ distance to C(T) is at least r. Then 
a — 1 < \A r \. It remains to bound the size of A r . 

Let b(n, m) be the maximal possible size of A m among all binary trees on n leaves. Note that 
b(n,m) = if n < 2 m . Let T be a tree on n leaves and e an edge of T. Let T\ and T2 be the 
two trees obtained from T by removing the edge e. Note that the number of directed edges in T of 
£ distance at least m from C{T) is at most five more than the sum of the number of such edges in 
T\ and T2 (we may add at most the new edge and the four new edges adjacent to it). 



9 



For every binary tree on n leaves there exists an edge e such that removing the edge e results in 
two tree T U T 2 such that |£(Ti)| > |£(T)|/3 and |£(T 2 )| > \C(T)\/3. We therefore conclude that 

b(n, m) < max{6(ni, m) + b(n 2 , m) + 5 : n± + n 2 = n and n\ > n/3 and n 2 > n/3}. 

It now follows by easy induction that 

b(n, m) < max{30 x 2~ m n - 5, 0} 

for all n. In particular, 

a < 1 + \A r \ < 1 + max{30 x 2~ r n - 5,0} < 1 + 30 x T T n. 

as needed. ■ 

4 Super-trees for edge sharing trees 

In this section we show how to build the super-tree of a collection of edge-sharing trees. 

Definition 4.1. Let T be a binary tree and Li,L 2 , ■ ■ ■ ,L a C C{T). We say that T\L\, . . . ,T\L a are 
edge sharing if there is no partition Si U S 2 of {1, . . . , a} such that T\Lp and T\L^ are edge disjoint 
for all (3 £ Si and 7 G S 2 . 

From Lemma 13.31 it follows that 

Proposition 4.1. Let T be a binary tree and L±,L 2 , . . . ,L a C C-{T). Then T\Li, . . . ,T\L a are edge 
sharing if and only if there is no partition SiL)S 2 of {1, ... ,a} for which T\ Lp and T\ Ufj £ s 2 

are edge disjoint. 

In the main result of this section we prove the following 

Theorem 4.2. Let T be a binary tree and Li, L 2 , . . . , L a C C(T) such that T\L\, . . . , T\L S are edge 
sharing. Let V = U^ =1 Li and T = T\L' . For 1 < (3 < a, let 

S(f3) = {7 : T\Lp and T|L 7 are edge sharing }, 

and let SLp = U 76 5rmL T . Then 

• The tree T' is determined by the trees (T|S'L^)^ =1 . 

• Moreover, given the trees (T\SLp)p =1 , there is a polynomial time algorithm that computes T' . 

Theorem 14.21 states that it is possible to glue together a collection of edge-sharing trees, given 
some "local" tree structures. 

Lemma 4.3. Assume the setting of Theorem \4-ty Let e £ £(T\SLp) satisfy that there exist u,v G Lp 
with e G path T | 5£ ^(u, v). Let Aq\Bq be the partition defined by e on SLp. Then there exists a unique 
partition A\B G S(T') such that Aq = A n SLp and Bq = B n SLp. 

Moreover, for every edge e G £(T'), there exists 1 < (3 < a, leaves u,v G Lp and e G 
path T |£ L ^(u, v ) such that the partition of T\SLp defined by e is given by An SLp\B n SLp, where 
A\B is the partition defined by e. 



10 



Proof. Since T'\SL/3 = T\SLp, there exists a partition A\B of L' corresponding to an edge of T" 
which satisfies A D SLp = Aq and B n = £>o- Thus, in order to prove the first claim, it remains 
to show that A\B is unique. 

Write e = (xi, Xk+i) and let (xi,^), . . • , (xk,Xk+i) be the path corresponding to e in T' . Note 
that A\B defines a partition satisfying Aq = AnSLp and Bq = BnSLp if and only if A\B corresponds 
to one of the edges (x m , x m +\). The last claim follows from the fact that removing an edge of T 1 
that doesn't correspond to any edge in T\SLp induces the trivial partition on SLp and removing an 
edge e that correspond to the edge e of T\SLp corresponds to the partition defined by e. 

Therefore, it suffices to show that k = 1. 

Suppose that k > 1. Since the edge (xi, X2) is not defined in T'\SLp, and the collection (T\Sp)p =1 
is edge-sharing, it follows that there exists (u,v) £ L 7 such that such that (xi,X2) G path(u, v) and 
7 ^ iS'(/3). But the fact that (xi,X2) £ path(n, u) implies that T\Lp and T|L 7 are edge sharing - a 
contradiction. Therefore k = 1 and the first claim follows. 

For the second claim, note that Lemma 13.21 and Proposition 14.11 imply that for all e G £(T'), 
there exists 1 < (3 < a and u,v £ Lp such that e G path T /(u, u). By the previous argument, the 
edge e corresponds to a unique edge e G £(T\SLp), as needed. ■ 

Proof Of Theorem 14.21 We will show how to reconstruct T 1 in time polynomial in n. Clearly, 
it suffices to show how to reconstruct S(T') in polynomial time. From Lemma 14.31 it follows that in 
order to find E(T') it suffices to find for all 1 < /3 < a and all edges e G £{T\SLp) which satisfy 
e € path 5L(3 (u,t;) for u,v £ L@: 

• The partition Ao|-Bo of corresponding to the edge e. 

• The unique partition A\B of U satisfying Aq = An SLp and Bq = B n SX/3. 

Given T|5L/3 it is trivial to find the partition Aol-Bo of SLp corresponding to e. All that remains to 
show is how to find the partition A\B corresponding to the edge e in T". This is the unique partition 
satisfying Aq = A n L' and B = B f] L' . 

For S" d >S'(/3) let L = U 76 5'-L 7 . Lemma 14.31 allows to identify the edge e in T\SLp with the 
unique edge corresponding to e in T|L. We will use this identification below. 

We now give an inductive construction of A and B. Let Sq((3) = S(f3) and continue inductively 
by letting for d > 1 

• S d ((3) = {7 : 35 G s.t. (L tf) L 7 ) are edge sharing } \ (U c<d S c ({3)) . 

• Let = Ad-i Li^ eS A^ Lj, where 7 G Sd(/3) belongs to S^{(5) if the following condition holds: 
There exists leaves u',v' G Ad-i and leaves u,v £ such that path(n',f') n path(u, v) 7^ 0. 

• Similarly, let B d = B^i U^ &s b^ L 7 , where 7 G Sd(/3) belongs to S^{f5) if the following 
condition holds: There exists leaves u',v' £ Bj-i, leaves u,v £ L 7 such that path(u', v') n 
path(n, v) / 0. 

The above construction is repeated until Sd(f3) = 0- 

We now prove the validity of the construction. First, from the fact that T\L\, . . . , T\L a are edge 
sharing, it follows that {1,2, ... ,a} = Ud>oSd(f3). We write Sd(f3) for Li c <dSd(P)- 
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Claim 4.4. For all d > it holds that S d (P) = S£((3) U Sf(j3). For all d>0, the partition A d \B d 
is the partition of the tree T\ U 1& gjp\ L 7 defined by the edge e. 

Proof. The proof is by induction on d. The base case d = is immediate. For the inductive step 
note that under the induction hypothesis for d — 1, the partition A d _i\B d _i is the partition induced 
by e on the tree T\ U 7gi g d L 7 . 

Let 7 G S d (/3). Clearly T|L 7 and T|yLj_i U -B<i_i share edges. On the other hand since e is not 
an edge of T|L 7 it follows that either T|L 7 and T\A d _i share edges or T|L 7 and T\B d -i share edges. 
In the first case L 7 C A d , while in the second case L 7 C B d . 

The claim follows. I 

The proof of the theorem follows as when the algorithm terminates, the sets A d \B d define the 
desired partition. It is clear that the algorithm described above runs in polynomial time. ■ 

5 Distorted metrics on trees 

In this section we will prove Theorem l2.11 We will assume (J2J below (i.e. / < d(e) < g for all e G E). 
It is helpful to define "balls" with respect to d and d as follows. 

B L (v, r) = {w G C(T) : d(v, w) < r}, B v (v, r) = {w G V(T) : d(v, w) < r}. (4) 

We similarly define i?i(f,r) and By(v,r) with d instead of d. 
We omit the proof of the following easy Lemma. 

Lemma 5.1. Let d be an (e,M) distortion of d and let r < M. Consider the sets Lp = i?i(u^,r) 
for (3 = 0, . . . , a. Suppose that T\Lq and T\Lp are edge sharing for (3 = 1, . . . , a. Then for all 
u,u' £ V(T\ U*g =0 L a ) it holds that 

d(u, v) < 6r + 6e. (5) 
Similarly, for all u, u' G D^ =Q L a it holds that 

d(u, v ) < 6r + 7e, (6) 

if d(u, v) < oo. 

Proof. Equation follows immediately from ©. Let T" = T\ U^ =0 Lp. Note that if (3 < a, the 
leaves u±, U2 belong to Lp and u is a vertex on the path connecting u\ and U2, then 

v^) < sup d(w, vp) < sup d(w, vp) + e < r + e. (7) 

(the first inequality follows from the fact that if v is a vertex in a tree with a path metric, then the 
vertex furthest away from v is a leaf). 

Now let u, v! G V(T'). Since the trees (T\Lp)p< a are edge sharing, it follows u belongs to a path 
connecting two points in L 7 and u' belongs to a path connecting two point in Ly , where < 7, 7' < (3. 
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Let w be a vertex that belongs to an edge e such that e S path^m^i, w 7 ,2) H path(wo,i 5 ^0,2) 
where Wy,2 6 £7 and iUo,l> "^0,2 G -^0- Define u/ similarly. Then by Q 

d(u,u) < d(u, v-y) + d(yj, w) + d(w, vo) + d(vo, w) + d(w , vy) + d(vy, u) 
< 6(r + e), 

as needed. ■ 

Proof of Theorem EHJ Let 

r = M^, L = (^ =1 , L p = B(vp,r). 

Define a graph G on the set of vertices {1, . . . , n} where the edge (f3, 7) is present if and only if Lg 
and L 7 are edge-sharing. Let C\, . . . , C a be the partition of {1, . . . , n} to connected components in 
G. It is easy to see that the graph G can be computed in polynomial time. Indeed by Lemma 15. II it 
follows that if L@ and L 7 share edges then for all u G Lp and v E L 7 it holds that d(u, v) < M . For 
sets L j g,L 7 for which this condition holds, we may easily reconstruct the tree T\Lp U L 7 using the 
4-point method. We can then check if T\Lp and T|L 7 are edge sharing in T\Lp U L 7 . 

By proposition (|4.1|) it follows that if a 7^ r then the trees T| U/jgCa and T| U/3 g( 7 T ^ are edge 
disjoint. Moreover, if 1 < r\ < a, then the collection of trees (T\Lp)p E o are edge sharing. 

Note that in the notation of Theorem 14.21 for all j3 and all u, v € 5X/3 it holds that d(u, v) < M - 
by Lemma 15. II 

It follows that the trees T\SLp may be easily recovered by the 4-point method. Moreover, for 
every edge e € £{T\SLp) we may recover d(e) satisfying \d(e) — d(e)\ < 2e. 

We now use Theorem 14. 21 in order to recover the trees T\ Up^c a for all a. Moreover, we may 
recover d{e) satisfying \d{e) — d{e)\ < 2e. 

It remains to bound the number of trees a using Lemma 13.61 Note that if l^((u,v), C(T)) < p, 
then there is a path of p— 1 edges starting at u, avoiding v and ending at C(T) at a node denoted u' . 
Similarly, there is a path of p— 1 edges starting at v avoiding u and ending at C(T) at a node denoted 
v'. Note that d(u',v') < (2p - l)g and therefore d(u',v') < (2p - l)g + e. Thus if p < \ + 

then d(u',v') < M and therefore the edge e belongs to one of the trees T\ U^ 6 c CT Lp. It follows from 
Lemma 13.61 that 

I 1 I M-e I QOn M-e 

a < 1 + 30n x 2 U+ 2ff J < 1 + x 2 2 9 . 

a/2 

The theorem follows. ■ 
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A Large deviations for the log-det distance 



In this section, we prove Proposition 12.21 We will use the following large deviation result from :6 a . 

Lemma A.l (|6j). For every two vertices u,v £ C(T) and a,b £ A, let F a ^ be the probability that 
node u has letter a and node w has letter b. Let —d(u,v) = logdet(-Fjj). Then d(u,v) is a path 
metric on the tree satisfying 

• d(e) > / for all edges e of the tree. 

• For all u,v G C(T): 

d(u,v) < -l^llogTTndn- ^ logdet(M(e)) 

efepath(«,);) 

< - | A\ log7r min + 2p|path(n,u)|. 

Moreover for u, v € £(T) let F a ^ be the empirical distribution of having a at u and v in b in a 
collection of k samples. Let d(u,v) = — log det(F a j,) if det(F a> f,) > and d(u,v) = oo otherwise. 
Then there exists positive constants c\ and C2 such that 

P [| e -d(«,») _ e -i(«,*)| > t ] < 2eX p ^-ciJfe (t -jf\ (8) 

where (a) + = max{0, a}. 

For the proof see [HI Section 7]. Equation (jHJ) here is equation (49) in jS] up to change of notation. 
Taking t = e" M " e (l - e' 2e ) in © we see that if | e - d(u ' 1 ' ) - e~ d ^ v) \ < t and either d(u,v) <M + e 
or d(u,v) <M + e, then \d(u,v) - d(u,v)\ < e. So if \ e - d ^ - e ~ d ^\ < t for all u and v, then d 
is an (e, M) distortion of d. 

Taking k that satisfies (J5J we obtain that the error is at most 
if c (in ©) is sufficiently large. The proof of Proposition 12.21 follows. 

B Lower bounds 

In this section we prove tightness of both the distorted metric result and the phylogenetic recon- 
struction result. 

The tightness of the metric result follow easily by considering the r-level 3-regular tree with the 
metric d that assigns length d to all edges of the tree. We let d(u, v) = d(u, v) if d(u, v ) < M and 
d(u, v) = oo otherwise. Then d is a (0, oo) distortion of d. 

Define the relation u ~ v if d(u, v) < M. It is easy to see that ~ is an equivalence relation. 
There are n2~L M / 2 ffJ equivalence classes for this relation. It is easy to reconstruct the tree on each 
class, but since for u,v which belong to different classes, d(u, v) = oo, it is impossible to reconstruct 
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any more. This prove the tightness of the number of trees in Theorem 12.11 up to a multiplicative 
constant. 

A similar construction yields the analogous sampling complexity lower bound for phylogenetic 
trees. We fix the model to be the CFN model where the length of each edge is g. Thus the mutation 

'e~B + (1 - e~ g )/2 (1 - e~ g ) /2 



matrices are given by M(e, - ^ ^ _ ^ & _ g + ^ _ ^ 

Thus for each edge (u,v), the state of u is copied to v with probability e _ff . Otherwise, an 
independent uniform state is chosen. 

Following the arguments of ^3] implies that if v is a vertex at ^-distance s from the set of leaves, 
then the character value at the leaves below v is independent of the character at v with probability 
at least 1 — 2 s e~ gs . Thus the character at the leaves is independent from all nodes at level s with 
probability at least 1 — 3 * 2 r ~ s ~ 1 2 2 e~ gs = 1 — ne~ gs . The probability that the former event will 
occur for k characters is at least p/% = 1 — kne~ gs . 

Let assume further that the phylogenetic tree on each of the equivalence classes of the relation 
~ defined by u ~ v if d(u,v) < 2gs is given. Then with probability p^, there is no non-trivial 
information about the ancestral relationship except that given by the given n * 2~ s trees. 

Note furthermore that p^ > 1 — 5 if k < 5e gs /n = 5e M /n. This proves the tightness of condition 
PJl in Theorem 12.31 up to a factor 2 in the exponent and a multiplicative 0{n) factor. 



C Variants of the method 

We briefly sketch a few variants of the method which may be practical advantages over the method 
analyzed here. 



C.l Checking if two balls define tree disjoint trees 

The first stage of the algorithm consists of checking if two balls Bi,(v,r) and Bi(u,r) define two 
edge-sharing trees or two edge-disjoint trees. Most of the work at this stage is devoted to couples 
of trees that are edge-disjoint. In fact for most such pairs it would hold that d(u, v) > M which 
implies automatically edge-disjointness without additional computation. Thus the efficient way of 
computing the graph G is by first checking for each u and v if d(u, v) > M. Otherwise, we perform 
the test described in the proof of Theorem 12.11 

C.2 Building supertrees for edge disjoint trees 

A lot of computational effort is devoted to building super-trees from collection of edge-sharing trees. 
There are many variants that work here. Instead of the method described in the paper, we can use 
quartets method as in 0. Similarly to one can prove that given a collection of edge sharing 
trees T±, . . . ,T a , their super-tree is in fact defined by all quartets belonging to the trees T\SLp for 
1 < (3 < a via the dyadic closure operator. This may lead to a computationally more efficient 
algorithm than ours. 
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